Goto

Collaborating Authors

 multimodal ai system


Towards deployment-centric multimodal AI beyond vision and language

Liu, Xianyuan, Zhang, Jiayang, Zhou, Shuo, van der Plas, Thijs L., Vijayaraghavan, Avish, Grishina, Anastasiia, Zhuang, Mengdie, Schofield, Daniel, Tomlinson, Christopher, Wang, Yuhan, Li, Ruizhe, van Zeeland, Louisa, Tabakhi, Sina, Demeocq, Cyndie, Li, Xiang, Das, Arunav, Timmerman, Orlando, Baldwin-McDonald, Thomas, Wu, Jinge, Bai, Peizhen, Sahili, Zahraa Al, Alwazzan, Omnia, Do, Thao N., Suvon, Mohammod N. I., Wang, Angeline, Cipolina-Kun, Lucia, Moretti, Luigi A., Farndale, Lucas, Jain, Nitisha, Efremova, Natalia, Ge, Yan, Varela, Marta, Lam, Hak-Keung, Celiktutan, Oya, Evans, Ben R., Coca-Castro, Alejandro, Wu, Honghan, Abdallah, Zahraa S., Chen, Chen, Danchev, Valentin, Tkachenko, Nataliya, Lu, Lei, Zhu, Tingting, Slabaugh, Gregory G., Moore, Roger K., Cheung, William K., Charlton, Peter H., Lu, Haiping

arXiv.org Artificial Intelligence

Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.


Translating Multimodal AI into Real-World Inspection: TEMAI Evaluation Framework and Pathways for Implementation

Li, Zehan, Deng, Jinzhi, Ma, Haibing, Zhang, Chi, Xiao, Dan

arXiv.org Artificial Intelligence

Translating Multimodal AI into Real-World Inspection: TEMAI Evaluation Framework and Pathways for Implementation Zehan LI 1,3, Jinzhi Deng 1,2, Haibing Ma 1,2, Chi Zhang 1, and Dan Xiao 1 1 Moximize.ai 2 Shanghai Zhongqiao Vocational And Technical University 3 China Creative Studies Institute April 22, 2025 Abstract This paper introduces the Translational Evaluation of Multimodal AI for Inspection (TEMAI) framework, bridging multimodal AI capabilities with industrial inspection implementation. Adapting translational research principles from healthcare to industrial contexts, TEMAI establishes three core dimensions: Capability (technical feasibility), Adoption (organizational readiness), and Utility (value realization). The framework demonstrates that technical capability alone yields limited value without corresponding adoption mechanisms. TEMAI incorporates specialized metrics including the Value Density Coefficient and structured implementation pathways. Empirical validation through retail and photovoltaic inspection implementations revealed significant differences in value realization patterns despite similar capability reduction rates, confirming the framework's effectiveness across diverse industrial sectors while highlighting the importance of industry-specific adaptation strategies. Keywords: Multimodal AI, Industrial Inspection, Translational Framework, TEMAI 1 Introduction Industrial inspection tasks are fundamental to ensuring operational continuity and safety in manufacturing sectors, serving as a cornerstone for preventive maintenance and risk mitigation. These tasks, however, are plagued by systemic inefficiencies, including labor-intensive workflows, hazardous working environments (e.g., high-temperature zones or toxic gas exposure), and heavy reliance on empirical knowledge that is difficult to standardize or transfer across industries[1]. Despite incremental advancements in automation technologies--such as drones, AR-assisted devices, and IoT-enabled sensors--the integration of these tools into inspection workflows has yielded limited returns due to fragmented deployment, high implementation costs, and insufficient interoperability between hardware and software systems [2]. For instance, while drones have reduced human exposure to dangerous environments in power grid inspections, their operational scope remains constrained by battery life and data processing bottlenecks[3].


ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants

Watkins, Elizabeth Anne, Moss, Emanuel, Manuvinakurike, Ramesh, Shi, Meng, Beckwith, Richard, Raffa, Giuseppe

arXiv.org Artificial Intelligence

In this short paper we address issues related to building multimodal AI systems for human performance support in manufacturing domains. We make two contributions: we first identify challenges of participatory design and training of such systems, and secondly, to address such challenges, we propose the ACE paradigm: "Action and Control via Explanations". Specifically, we suggest that LLMs can be used to produce explanations in the form of human interpretable "semantic frames", which in turn enable end users to provide data the AI system needs to align its multimodal models and representations, including computer vision, automatic speech recognition, and document inputs. ACE, by using LLMs to "explain" using semantic frames, will help the human and the AI system to collaborate, together building a more accurate model of humans activities and behaviors, and ultimately more accurate predictive outputs for better task support, and better outcomes for human users performing manual tasks.


Global Big Data Conference

#artificialintelligence

Earlier this month, researchers at the Allen Institute for AI -- a nonprofit founded by late Microsoft cofounder Paul Allen -- released an interactive demo of a system they describe as part of a "new generation" of AI applications that can analyze, search across, and respond to questions about videos "at scale." Called Merlot Reserve, the researchers had the system "watch" 20 million YouTube videos to learn the relationships between images, sounds, and subtitles, allowing it to, for example, answer questions such as "What meal does the person in the video want to eat?" or "Has the boy in this video swam in the ocean before?" Systems that can process and relate information from audio, visuals and text have been around for years. These technologies continue to improve in their ability to understand the world more like humans. San Francisco research lab OpenAI's DALL-E, which was released in 2021, can generate images of objects -- real or imagined -- from simple text descriptions like "an armchair in the shape of an avocado."